A Cache Related Preemption Delay Analysis for Multi-level Non-inclusive Caches
نویسندگان
چکیده
domain. Assume thatM represents the set of all memory blocks and Dc captures the set of inclusion patterns of a memory block in a two-level cache hierarchy. The domain of the analysis (D) is the set of all valid mappings fromM to Dc as follows. D : M → (Dc ∪ {⊤}) (1) ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY. A:10 S. Chattopadhyay and A. Roychoudhury with Dc : {0, 1, . . . ,K,∞}× {0, 1, . . . ,K,∞} (2) ⊤ is an additional element in the abstract domain to capture the uncertain information during the analysis, K = max(K1,K2) and ∞ captures numbers ≥ K + 1. Therefore, ∞ abstracts away all the scenarios where a specific memory block is not present in a certain cache level. Transfer operation at each program point. We first perform the must-cache analysis on the preempted task. We use [Theiling et al. 2000] for analyzing L1 cache and we use [Hardy and Puaut 2008] for analyzing L2 cache. As an outcome of must-cache analysis, we obtain the abstract cache content at each program point. According to the must-cache analysis, let us assume that the tuple MustAgem,p = 〈age1, age2〉 captures LRU ages of memory blockm (in both the cache levels) immediately before the program point p. Ifm is not present in L1 cache,MustAgem,p can be captured as a tuple 〈∞, age2〉. Similarly, if m is not present in L2 cache, MustAgem,p can be captured as a tuple 〈age1,∞〉. The transfer function τ modifies the abstract state (i.e. an element from the abstract domainD) at each program point. Since the direction of the analysis is backward, such a transfer function takes the abstract state after a program point as input and computes the abstract state before the program point as output. Formally, the transfer function τb can be defined as follows. τb : D× P → D τb(D, p) = D[mp 7→ MustAgemp,p] (3) P denotes the set of all program points and mp captures the memory block accessed at program point p. D captures the abstract state after the program point p. D[mp 7→ MustAgemp,p] updates the mapping of memory block mp (i.e. the memory block accessed at program point p) as mp 7→ MustAgemp,p. The mapping of all other memory blocks exceptmp remain unchanged after applying the transfer function τb. Join operation to merge multiple abstract states. Since programs usually contain branches and loops, an abstract join operation is used to combine multiple abstract states at the control flow merge points. To define the join operation, we need to define a partial order on Dc. Recall that Dc captures the set of possible inclusion patterns of a memory block in the two-level cache hierarchy. In the following, we shall first define a couple of basic operations (∆ and ⊞) on the domain Dc. Such basic operations are required to establish a partial order among the elements of Dc. Let us consider a tuple cu = 〈cu1, cu2〉 ∈ Dc. For some memory blockm ∈ M, assume that the mappingm 7→ 〈cu1, cu2〉 belongs to some abstract state during the backward-flow analysis. We use a function∆ to compute the latency of fetching the memory blockm.∆ is defined as follows. ∆ : Dc → N ∆(〈cu1, cu2〉) = 0, if cu1 6= ∞∧ cu1 ≤ K1; LAT 1, else if cu2 6= ∞∧ cu2 ≤ K2; LAT 1 + LAT 2, otherwise. (4) Recall that LAT 1 and LAT 2 denote the fixed L1 and L2 cache miss penalties, respectively. In the following, the operation⊞ captures the element-wise addition operation in the domain Dc. Assume cu = 〈cu1, cu2〉 ∈ Dc and cv = 〈cv1, cv2〉 ∈ Dc. In the following, we define the operation ⊞ between cu and cv. For the sake of simplicity, ⊞ is defined using the infix notation. ⊞ : Dc × Dc → Dc ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY. Cache Related Preemption Delay Analysis for Multi-level Non-inclusive Caches A:11 〈cu1, cu2〉⊞ 〈cv1, cv2〉 = 〈cuv1, cuv2〉 (5) where cuvi = { ∞, if cui = ∞∨ cvi = ∞∨ cui + cvi > Ki; cui + cvi, otherwise (6) Note that saturation is used (using the element∞) in the addition operation⊞ instead of overflow. Given cu1, cu2 ∈ Dc, an informal description of the partial order can be introduced as follows: cu1 cu2 if and only if cu2 leads to more cache reload latency compared to cu1 in the presence of any additional cache conflict ce. Therefore, the partial order can be captured by the following logical equivalence: cu1 cu2 ⇔ ∀ce ∈ Dc. ∆(cu1 ⊞ ce)−∆(cu1) ≤ ∆(cu2 ⊞ ce)−∆(cu2) (7) However, it is possible that cu1 cu2 and cu2 cu1. Therefore, we introduce a join semi-lattice Dc ∪{⊤} to define the least upper bound operator.⊤ captures the uncertain information during the analysis and therefore, for any cu ∈ Dc, cu ⊤. Now we can define the least upper bound operator ⊔ on the set Dc ∪ {⊤} as follows. ⊔ : (Dc ∪ {⊤})× (Dc ∪ {⊤}) → Dc ∪ {⊤} ⊔ (cu1, cu2) = ⊤, if cu1 = ⊤ ∨ cu2 = ⊤; cu2, if cu1 cu2; cu1, if cu2 cu1; ⊤, otherwise. (8) The abstract join operation in our backward-flow analysis merges two abstract states from the abstract domain (i.e. D). Assume that D1,D2 ∈ D. For a memory block m ∈ M, let us assume D1(m) = cu1 and D2(m) = cu2. After the join operation, the least upper bound of cu1 and cu2 (i.e. ⊔ (cu1, cu2) as defined in Equation (8)) is mapped to memory block m. Therefore, the formal definition of the join operation ĴD is as follows. ĴD : D× D → D ĴD(D1,D2) = ⋃ m∈M {m 7→ ⊔ (cu1, cu2)| D1(m) = cu1 ∧ D2(m) = cu2} (9) Initialization. We start our backward flow analysis with an abstract state {m 7→ 〈∞,∞〉 | m ∈ M}. At each program point, we check the accessed memory block and apply our transfer function τb as described in Equation (3). Since our analysis is a backward-flow analysis, the abstract state at the exit of a basic block is computed by combining all the abstract states at the entry of its successors (via the join operation ĴD in Equation (9)). The analysis terminates when a fixed-point is obtained at each program point. 4.3. Forward flow analysis Forward flow analysis is primarily required to compute the indirect effect of preemption. Recall that the indirect effect of preemption is potentially caused by memory blocks which were L1 cache hits before the preemption, but they may access L2 cache after the preemption (cf. Figure 2). With respect to a program point p, forward flow analysis computes a set of memory blocks Mp where eachm ∈ Mp satisfies the following two conditions: —mmust be accessed along one of the paths starting from the entry point of the program and ending at p. We call such references ofm reachable references to p. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY. A:12 S. Chattopadhyay and A. Roychoudhury —At least one of the reachable references of m (w.r.t. p) must be an L1 cache hit in the absence of preemption. Therefore, the abstract domain of the analysis is all possible subsets of memory blocks accessed in the program (i.e. 2). The abstract transfer function τf is applied at each program point. Since the analysis direction is forward, τf takes the abstract state before a program point p (sayM ∈ 2 ) as input and it computes the abstract state after the program point p as output. Additionally, τf uses the must-cache analysis results [Theiling et al. 2000] to detect L1 cache hits at a particular program point. According to the must-cache analysis, let us assume MustACSp,1 captures the content of L1 cache immediately before the program point p. If mp is the memory block accessed at program point p and mp is contained in MustACS p,1 (i.e. the memory access at p is a guaranteed L1 cache hit), τf augments the input abstract state M withmp. The formal definition of τf is as follows. τf : 2 M × P → 2 τf (M, p) = { M∪ {mp}, if mp ∈ MustACSp,1; M, otherwise. (10) The abstract join operation simply performs a set union of multiple abstract states at a controlflow merge point. Our forward flow analysis starts with the empty set and at each program point, we apply the transfer function τf (as defined in Equation (10)). Since the direction of the analysis is forward, the abstract state at the entry of each basic block is computed by taking a simple set union of all the abstract states at the exit of its predecessors. 4.4. Analysis of the preempting task For an accurate computation of CRPD, we need to compute the set of cache blocks possibly used by the preempting task. The set of used cache blocks by the preempting task is called evicting cache blocks (ECB) [Lee et al. 1998]. Since we consider set-associative LRU caches, for each cache set, we compute the maximum number of cache blocks used by the preempting task. ECBs can easily be computed by performing the may-cache analysis on the preempting task (using [Hardy and Puaut 2008]). The may-cache analysis computes an over-approximation of cache contents at each program point. Let us consider the exit point e of the preempting task. Therefore, the analyzed cache content at e must include all possibly accessed memory blocks (subject to the size of cache) in the preempting task. According to the may-cache analysis, let us assume that MayACS e,1(s) andMayACS e,2(s) denote the contents of L1 and L2 cache set s, respectively, at the exit point e of the preempting task. For each memory block m accessed in the preempted task, we define a tuple CEm = 〈CEm,1, CEm,2〉. Intuitively, CEm captures the maximum number of ECBs mapping to the same cache sets as m. Let us assume that memory block m is mapped to cache set Sm,1 (Sm,2) in L1 (L2) cache. Therefore, we can define CEm = 〈CEm,1, CEm,2〉 as follows. CEm = 〈CEm,1, CEm,2〉 where CEm,i = |MayACS e,i(Sm,i)| (11) 4.5. Preemption delay computation In this section, we show the CRPD computation using results of backward-flow analysis (Section 4.2), forward-flow analysis (Section 4.3) and must-cache analyses [Theiling et al. 2000; Hardy and Puaut 2008]. We start with a few definitions. The following definitions are based on the fixed-point computations by backward and forward flow analyses. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY. Cache Related Preemption Delay Analysis for Multi-level Non-inclusive Caches A:13 — CUm,p : After backward-flow analysis, let us assume that Db,p captures the fixed-point on the abstract state at program point p. Therefore, Db,p ∈ M → Dc ∪ {⊤}. We define CUm,p as Db,p(m). As a result, CUm,p ∈ Dc ∪ {⊤}. —Df,p : After forward-flow analysis, Df,p captures the fixed-point on the abstract state at program point p. Therefore,Df,p ∈ 2 . Additionally, we use CEm = 〈CEm,1, CEm,2〉 to capture evicting cache blocks (ECB) conflicting with memory blockm (cf. Equation 11). 4.5.1. Indirect preemption factor. We had earlier illustrated that a CRPD analysis solely based on UCBs and ECBs may lead to an unsafe result (cf. Figures 3(a)-(b)). The root cause of such complications arises due to the presence of indirect effect (cf. Figure 2). Therefore, in the following, we first define a quantity which is crucial to take into account the indirect effect of preemption. For the sake of clarity, we first show the CRPD computation with respect to a specific preemption point p. Subsequently, we show the computation of CRPD for an arbitrary preemption point. Given a preemption point p, we compute a quantity IDr,p for a program point r. Let us assumemr denotes the memory block accessed at r. Intuitively, IDr,p captures an over-approximation on the set of memory blocks which may create indirect preemption effect tomr. Such memory blocks must have been accessed from L1 cache in the absence of preemption, however, they might be accessed from L2 cache after preemption. Therefore, any memory blockm in IDr,p must satisfy all the conditions as stated below. —m must be accessed along some path starting from the entry node and ending at r. Additionally, r must be reachable from at least one reference ofm that must be an L1 cache hit in the absence of preemption. Therefore,m ∈ Df,r. —m must be accessed after preemption point p and such an access must be an L1 cache hit in the absence of preemption. Therefore, CUm,p 6= 〈∞,∞〉. If CUm,p = 〈∞,∞〉, m is either not accessed after preemption point p or any immediate access of m beyond p might be an L1 cache miss in the absence of preemption (cf. Section 4.2). —m might suffer an L1 cache miss due to preemption and m must map to the same L2 cache set as mr. Let us assume CUm,p = 〈CUm,p,1, CUm,p,2〉. Therefore, CUm,p,1 + CEm,1 > K1. If m is mapped to the L2 cache set Sm,2, we additionally have Sm,2 = Smr ,2. Aggregating the above notion of description, we can now formally define IDr,p as follows. IDr,p = {m | m 6= mr ∧m ∈ Df,r ∧ Sm,2 = Smr,2 ∧ CUm,p 6= 〈∞,∞〉 ∧ (CUm,p = ⊤ ∨ (CUm,p 6= ⊤ ∧ CUm,p,1 + CEm,1 > K1))} (12) 4.5.2. CRPD computation. For an arbitrary preemption point p, an overview of the preemption delay computation is shown in Figure 5. The computed preemption delay depends on cache hit-miss categorizations of memory references (i.e. L1 and L2 cache hit/miss) in the absence of preemption. For L1 cache hits in the absence of preemption, it is sufficient to check only the first access of the respective memory block after preemption [Lee et al. 1998]. This is because the respective memory block will be reloaded into L1 cache once it is first accessed after the preemption. For such memory blocks, CRT p,1 captures the reloading delay from L2 cache and CRT p,2 captures the reloading delay from main memory (cf. Figure 5). Let us now consider the memory references which were L1 cache misses, but L2 cache hits, in the absence of preemption. As evidenced by our example in Figure 3(b), it is insufficient to consider only the first references of such memory blocks after preemption. Therefore, we need to go through all program locations which were L1 cache misses, but L2 cache hits in the absence of preemption. We distinguish between the first access and all other accesses to such a program location (say r) after preemption. The first access to r after preemption may suffer L2 cache miss penalty due to the combined effect of intra-task and inter-task L2 cache conflicts. Our examples in Figures 3(a)-(b) ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY. A:14 S. Chattopadhyay and A. Roychoudhury
منابع مشابه
WCET analysis of instruction cache hierarchies 1
With the advent of increasingly complex hardware in real-time embedded systems (processors with performance enhancing features such as pipelines, caches, multiple cores), most embedded processors use a hierarchy of caches. While much research has been devoted to the prediction of Worst-Case Execution Times (WCETs) in the presence of a single level of cache (instruction caches, data caches, impa...
متن کاملCache analysis vs static cache locking for schedulability analysis in multitasking real-time systems
Cache memories have been extensively used to bridge the gap between high speed processors and relatively slow main memories. However, they are source of predictability problems and need special attention to be used in hard real-time systems. A lot of progress has been achieved in the last 10 years to model caches, in order to determine safe and precise bounds on (i) tasks WCETs in the presence ...
متن کاملSynergetic effects in cache related preemption delays
Cache prediction for preemptive scheduling is an open issue despite its practical importance. First analysis approaches use simplified models for cache behaviour or they assume simplified preemption and execution scenarios that seriously impact analysis precision. We present an analysis approach for m-way associative caches which considers multiple executions of processes and preemption scenari...
متن کاملWCET Analysis of Multi-Level Set-Associative Data Caches
Nowadays, the presence of cache hierarchies tends to be a common trend in processor architectures, even in hardware for real-time embedded systems. Caches are used to fill the gap between the processor and the main memory, reducing access times based on spatial and temporal locality properties of tasks. Cache hierarchies are going even further however at the price of increased complexity. In th...
متن کاملWiP: Scheduling Multi-Threaded Tasks to Reduce Intra-Task Cache Contention
Research on hard real-time systems and their models has predominately focused upon single-threaded tasks. When multi-threaded tasks are introduced to the classical real-time model the individual threads are treated as distinct tasks, one for each thread. These artificial tasks share the deadline, period, and worst case execution time of their parent task. In the presence of instruction and data...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013